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ROLLOUT  ALGORITHMS  FOR 
STOCHASTIC  SCHEDULING  PROBLEMS* 


by 

Dimitri  P.  Bertsekas1 2  and  David  A.  Castanon3 


Abstract 

Stochastic  scheduling  problems  are  difficult  stochastic  control  problems  with  combinatorial 
decision  spaces.  In  this  paper  we  focus  on  a  class  of  stochastic  scheduling  problems,  the  quiz  prob¬ 
lem  and  its  variations.  We  discuss  the  use  of  heuristics  for  their  solution,  and  we  propose  rollout 
algorithms  based  on  these  heuristics,  which  approximate  the  stochastic  dynamic  programming 
algorithm.  We  show  how  the  rollout  algorithms  can  be  implemented  efficiently,  and  we  delin¬ 
eate  circumstances  under  which  they  are  guaranteed  to  perform  better  than  the  heuristics  on 
which  they  are  based.  We  also  show  computational  results  which  suggest  that  the  performance 
of  the  rollout  policies  is  near-optimal,  and  is  substantially  better  than  the  performance  of  their 
underlying  heuristics. 
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2  Department  of  Electrical  Engineering  and  Computer  Science,  M.  I.  T.,  Cambridge,  Mass., 
02139. 

3  Department  of  Electrical  Engineering,  Boston  University,  and  ALPHATECH.  Inc.,  Burling¬ 
ton,  Mass.,  01803. 
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1.  Introduction 


1.  INTRODUCTION 


Consider  the  following  variation  of  a  planning  problem:  There  is  a  finite  set  of  locations  which 
contain  tasks  of  interest,  of  differing  value.  There  is  a  single  processor  on  which  the  tasks  are  to 
be  scheduled.  Associated  with  each  task  is  a  task-dependent  risk  that,  while  executing  that  task, 
the  processor  will  be  damaged  and  no  further  tasks  will  be  processed.  The  objective  is  to  find 
the  optimal  task  schedule  in  order  to  maximize  the  expected  value  of  the  completed  tasks. 

The  above  is  an  example  of  a  class  of  stochastic  scheduling  problems  known  in  the  literature 
as  quiz  problems  (see  Bertsekas  [1995],  Ross  [1983],  or  Whittle  [1982]).  The  simplest  form  of  this 
problem  involves  a  quiz  contest  where  a  person  is  given  a  list  of  N  questions  and  can  answer  these 
questions  in  any  order  he  or  she  chooses.  Question  i  will  be  answered  correctly  with  probability  pi, 
and  the  person  will  then  receive  a  reward  Vi.  At  the  first  incorrect  answer,  the  quiz  terminates 
and  the  person  is  allowed  to  keep  his  or  her  previous  rewards.  The  problem  is  to  choose  the 
ordering  of  questions  so  as  to  maximize  expected  rewards. 

The  problem  can  be  viewed  in  terms  of  dynamic  programming  (DP  for  short),  but  can 
more  simply  be  viewed  as  a  deterministic  combinatorial  problem,  whereby  we  are  seeking  an 
optimal  sequence  in  which  to  answer  the  questions.  It  is  well-known  that  the  optimal  sequence  is 
deterministic,  and  can  be  obtained  using  an  interchange  argument;  questions  should  be  answered 
in  decreasing  order  of  piVi/il  —pi).  This  will  be  referred  to  as  the  index  policy.  An  answer  order 
is  optimal  if  and  only  if  it  corresponds  to  an  index  policy.  Another  interesting  simple  policy 
for  the  quiz  problem  is  the  greedy  policy,  which  answers  questions  in  decreasing  order  of  their 
expected  reward  piVi.  A  greedy  policy  is  suboptimal,  essentially  because  it  does  not  consider  the 
future  opportunity  loss  resulting  from  an  incorrect  answer. 

Unfortunately,  with  only  minor  changes  in  the  structure  of  the  problem,  the  optimal  solution 
becomes  much  more  complicated  (although  DP  and  interchange  arguments  are  still  relevant). 
Examples  of  interesting  and  difficult  variations  of  the  problem  involve  one  or  more  of  the  following 
characteristics: 

(a)  A  limit  on  the  maximum  number  of  questions  that  can  be  answered,  which  is  smaller  than 
the  number  of  questions  N.  To  see  that  the  index  policy  is  not  optimal  anymore,  consider 
the  case  where  there  are  two  questions,  only  one  of  which  may  be  answered.  Then  it  is 
optimal  to  use  the  greedy  policy  rather  than  the  index  policy. 

(b)  A  time  window  for  each  question,  which  constrains  the  set  of  time  slots  when  each  question 
may  be  answered.  Time  windows  may  also  be  combined  with  the  option  to  refuse  answering 
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a  question  at  a  given  period,  when  either  no  question  is  available  during  the  period,  or 
answering  any  one  of  the  available  questions  involves  excessive  risk. 

(c)  Precedence  constraints,  whereby  the  set  of  questions  that  can  be  answered  in  a  given  time 
slot  depends  on  the  immediately  preceding  question,  and  possibly  on  some  earlier  answered 
questions. 

(d)  Sequence-dependent  rewards,  whereby  the  reward  from  answering  correctly  a  given  question 
depends  on'  the  immediately  preceding  question,  and  possibly  on  some  questions  answered 
earlier. 

It  is  clear  that  the  quiz  problem  variants  listed  above  encompass  a  very  large  collection  of 
practical  scheduling  problems.  The  version  of  the  problem  with  time  windows  and  precedence 
constraints  relates  to  vehicle  routing  problems  (involving  a  single  vehicle).  The  version  of  the 
problem  with  sequence-dependent  rewards,  and  a  number  of  questions  that  is  equal  to  the  max¬ 
imum  number  of  answers  relates  to  the  traveling  salesman  problem.  Thus,  in  general,  it  is  very 
difficult  to  solve  the  variants  described  above  exactly. 

An  important  feature  of  the  quiz  problem,  which  is  absent  in  the  classical  versions  of  vehicle 
routing  and  traveling  salesman  problems  is  that  there  is  a  random,  mechanism  for  termination 
of  the  quiz.  Despite  the  randomness  in  the  problem,  however,  in  all  of  the  preceding  variants, 
there  is  an  optimal  open-loop  policy,  i.e. ,  an  optimal  order  for  the  questions  that  does  not  depend 
on  the  random  outcome  of  the  earlier  questions.  The  reason  is  that  we  do  not  need  to  plan  the 
answer  sequence  following  the  event  of  an  incorrect  answer,  because  the  qui2  terminates  when 
this  event  occurs.  Thus,  we  refer  to  the  above  variations  of  the  quiz  problem  as  deterministic 
quiz  problems. 

There  are  variants  of  the  quiz  problem  where  the  optimal  order  to  answer  questions  depends 
on  random  events.  Examples  of  these  are: 

(e)  There  is  a  random  mechanism  by  which  the  quiz  taker  may  miss  a  turn,  i.e.,  be  denied  the 
opportunity  to  answer  a  question  at  a  given  period,  but  may  continue  answering  questions 
at  future  time  periods. 

(f)  New  questions  can  appear  and/or  old  questions  can  disappear  in  the  course  of  the  quiz 
according  to  some  random  mechanism.  A  similar  case  arises  when  the  start  and  end  of  the 
time  windows  can  change  randomly  during  the  quiz. 

(g)  There  may  be  multiple  quiz  takers  that  answer  questions  individually,  and  drop  out  of  the 
quiz  upon  their  own  first  error,  while  the  remaining  quiz  takers  continue  to  answer  questions. 
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(h)  The  quiz  taker  may  be  allowed  multiple  chances,  i.e.,  may  continue  answering  questions  up 
to  a  given  number  of  errors. 

(i)  The  reward  for  answering  a  given  question  may  be  random  and  may  be  revealed  to  the  quiz 
taker  at  various  points  during  the  course  of  the  quiz. 

The  variants  (e)-(i)  of  the  quiz  problem  described  above  require  a  genuinely  stochastic  for¬ 
mulation  as  Markovian  decision  problems.  We  refer  to  these  variations  in  the  paper  as  stochastic 
quiz  problems.  They  can  be  solved  exactly  only  with  DP,  but  their  optimal  solution  is  pro¬ 
hibitively  difficult.  This  is  because  the  states  over  which  DP  must  be  executed  are  subsets  of 
questions,  and  the  number  of  these  subsets  increases  exponentially  with  the  number  of  questions. 

In  this  paper,  we  develop  suboptimal  solution  approaches  that  are  computationally  tractable 
for  both  deterministic  and  stochastic  quiz  problems.  In  particular,  we  focus  on  rollout  algorithms, 
a  class  of  suboptimal  solution  methods  inspired  from  the  policy  iteration  methodology  of  DP  and 
the  approximate  policy  iteration  methodology  of  neuro-dynamic  programming  (NDP  for  short). 
One  may  view  a  rollout  algorithm  as  a  single  step  of  the  classical  policy  iteration  method,  starting 
from  some  given  easily  implementable  policy.  Algorithms  of  this  type  have  been  sporadically 
suggested  in  several  DP  application  contexts.  They  have  also  been  proposed  by  Tesauro  [1996] 
in  the  context  of  simulation-based  computer  backgammon  (the  name  “rollout”  was  introduced 
by  Tesauro  as  a  synonym  for  repeatedly  playing  out  a  given  backgammon  position  to  calculate 
by  Monte  Carlo  averaging  the  expected  game  score  starting  from  that  position). 

Rollout  algorithms  were  first  proposed  for  the  approximate  solution  of  discrete  optimization 
problems  by  Bertsekas  and  Tsitsiklis  [1996],  and  by  Bertsekas,  Tsitsiklis,  and  Wu  [1997],  and  the 
methodology  developed  here  for  the  quiz  problem  strongly  relates  to  the  ideas  in  these  sources. 
Generally,  rollout  algorithms  are  capable  of  magnifying  the  effectiveness  of  any  given  heuristic 
algorithm  through  sequential  application.  This  is  due  to  the  policy  improvement  mechanism  of 
the  underlying  policy  iteration  process. 

In  the  next  section,  we  introduce  rollout  algorithms  for  deterministic  quiz  problems,  where 
the  optimal  order  for  the  questions  from  a  given  period  onward  does  not  depend  on  earlier 
random  events.  In  Section  3,  we  provide  computational  results  indicating  that  rollout  algorithms 
can  improve  impressively  on  the  performance  of  their  underlying  heuristics.  In  Sections  4  and 
5,  we  extend  the  rollout  methodology  to  stochastic  quiz  problems  [cf.  variants  (e)-(i)  above], 
that  require  the  use  of  stochastic  DP  for  their  optimal  solution.  Here  we  introduce  the  idea 
of  multiple  scenaria  for  the  future  uncertainty  starting  from  a  given  state,  and  we  show  how 
these  scenaria  can  be  used  to  construct  an  approximation  to  the  optimal  value  function  of  the 
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problem  using  NDP  techniques  and  a  process  of  scenario  aggregation.  In  Section  6,  we  provide 
computational  results  using  rollout  algorithms  for  stochastic  quiz  problems.  Finally,  in  Section  7, 
we  provide  computational  results  using  rollout  algorithms  for  quiz  problems  that  involve  graph- 
based  precedence  constraints.  . 


2.  ROLLOUT  ALGORITHMS  FOR  DETERMINISTIC  QUIZ  PROBLEMS 


Consider  a  variation  of  a  quiz  problem  of  the  type  described  in  (a)-(c)  above.  Let  N  denote  the 
number  of  questions  available,  and  let  M  denote  the  maximum  number  of  questions  which  may 
be  attempted.  Associated  with  each  question  i  is  a  value  v$t  and  a  probability  of  successfully 
answering  that  question  p^.  Assume  that  there  are  constraints  such  as  time  windows  or  precedence 
constraints  which  restrict  the  possible  question  orders.  Denote  by  V(ii, . . .  Am)  the  expected 
reward  of  a  feasible  question  order  (ii, ....  ijv/): 

V(iu...,iM)  =Pi1(vil  +Pi2(vi2  +Pi3(---  +PiMViM)  •••))•  (2-1) 

For  an  infeasible  question  order  (ii,  ■  ■  ■  Am),  we  use  the  convention 

V(ii, . . .  Am)  =  -oo. 


The  classical  quiz  problem  is  the  case  where  M  =  N,  and  all  question  orders  are  feasible. 
In  this  case,  the  optimal  solution  is  simply  obtained  by  using  an  interchange  argument.  Let  i 
and  j  be  the  fcth  and  (k  +  l)st  questions  in  an  optimally  ordered  list 

L  =  (n,  •  •  ■  Ak-iAijAk+2,  ■  ■  ■  An). 

Consider  the  list 

L'  —  {h,  ■  ■  ■  Ak-i,jAAk+2,  ■  ■  ■  An) 

obtained  from  L  by  interchanging  the  order  of  questions  i  and  j.  We  compare  the  expected 
rewards  of  L  and  L'.  We  have 

Efreward  of  L}  —  E{reward  of  {n, . . . ,  Lt-i}} 

+  Pil  ■■•Pik-1(PiVi  +PiPjVj ) 

+  Pii  ■  ■  ■Pik_1PiPjE{ reward  of  {ik+ 2, . . .  An}} 

E{reward  of  L'}  =  E{reward  of  {ii, . . . ,  ijt-i}} 

+  Pi  1  '  •  •  Pik-i(PjVj  +  PjPiVi) 

+  Ph  '  ■■Pik_1PjPiE{rew&id  of  {ik+ 2,  •  •  •  ,iiv}}- 
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Since  L  is  optimally  ordered,  we  have 

^{reward  of  L}  >  2?{reward  of  L'}, 
so  it  follows  from  these  equations  that 

PiVi  +  PiPjVj  >  PjVj  +  PjPiVi 


or  equivalently 

PiVi  >  PjVj 
l- Pi  ~  1  -pj’ 

It  follows  that  to  maximize  expected  rewards,  questions  should  be  answered  in  decreasing  order 
of  piVi /{l  —  pi),  which  yields  the  index  policy. 

Unfortunately,  the  above  argument  breaks  down  when  either  M  <  N,  or  there  are  con¬ 
straints  on  the  admissibility  of  sequences  due  to  time  windows,  sequence-dependent  constraints, 
or  precedence  constraints.  For  these  cases,  we  can  still  use  heuristics  such  as  the  index  policy  or 
the  greedy  policy,  but  they  will  not  provide  optimal  performance. 

Consider  a  heuristic  algorithm,  which  given  a  partial  schedule  P  =  (4 , . . , ,  4)  of  distinct 
questions  constructs  a  complementary  schedule  P  =  (4+1,  •  •  • ,  i,u)  of  distinct  questions  such  that 
POP  =  0.  The  heuristic  algorithm  is  referred  to  as  the  base  heuristic.  We  define  the  heuristic 
reward  of  the  partial  schedule  P  as 

H(P)  =  V(ii, . . .  ,4,  4+1  ■  •  - ,  hi)-  (2.2) 

If  P  —  •  ,im)  is  a  complete  solution,  by  convention  the  heuristic  reward  of  P  is  the  true 

expected  reward  V(4, . . . ,  4/). 

Given  a  base  heuristic,  the  corresponding  rollout  algorithm  constructs  a  complete  schedule 
in  M  stages,  one  question  per  stage.  The  rollout  algorithm  can  be  described  as  follows: 

At  the  1st  stage  it  selects  question  i\  according  to 

4  =  arg  max  H(i),  (2.3) 

and  at  the  kth  stage  (A  >  1)  it  selects  4  according  to 

4  =  arg  max  H{i\, . . .  ,4-iG),  k  =  2,...,M.  (2.4) 

Thus  a  rollout  policy  involves  N  +  (N  —  1)  4 - +  (N  —  M)  =  0(MN )  applications  of  the 

base  heuristic  and  corresponding  calculations  of  expected  reward  of  the  form  (2.1).  While  this 
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is  a  significant  increase  over  the  calculations  required  to  apply  the  base  heuristic  and  compute 
its  expected  reward,  the  rollout  policy  is  still  computationally  tractable.  In  particular,  if  the 
running  time  of  the  base  heuristic  is  polynomial,  so  is  the  running  time  of  the  corresponding 
rollout  algorithm.  On  the  other  hand,  it  will  be  shown  shortly  that  the  expected  reward  of  the 
rollout  policy  is  at  least  as  large  as  the  one  of  the  base  heuristic. 

As  an  example  of  a  rollout  algorithm,  consider  the  special  variant  (a)  of  the  quiz  problem 
in  the  preceding  section,  where  at  most  M  out  of  N  questions  may  be  answered  and  there  are  no 
time  windows' or  other  complications.  Let  us  use  as  base  heuristic  the  index  heuristic,  which  given 
a  partial  schedule  (4,  • . , ,  attempts  the  remaining  questions  according  to  the  index  policy,  in 
decreasing  order  ofpiU*/(l  —pt).  The  calculation  of  H(ii, .  .**4)  is  done  using  Eq.  (2.1),  once  the 
questions  have  been  sorted  in  decreasing  order  of  index.  The  corresponding  rollout  algorithm, 
given  (4,  •  •  •  ,  4-i)  selects  i,  calculates  H(i i, . . . ,  4-i  A)  for  alH  ^  4, . . .  ,4—1,  using  Eq.  (2.1), 
and  then  optimizes  this  expression  over  i  to  select  4. 

Note  that  one  may  use  a  different  heuristic,  such  as  the  greedy  heuristic,  in  place  of  the 
index  heuristic.  There  are  also  other  possibilities  for  base  heuristics.  -For  example,  one  may 
first  construct  a  complementary  schedule  using  the  index  heuristic,  and  then  try  to  improve  this 
schedule  by  using  a  2-OPT  local  search  heuristic,  that  involves  exchanges  of  positions  of  pairs  of 
questions.  One  may  also  use  multiple  heuristics,  which  produce  heuristic  values  Hj  (4, . . . ,  4),  j  = 
1, . . . ,  J,  of  a  generic  partial  schedule  (4, . . . ,  4),  and  then  combine  them  into  a  “superheuristic” 
that  gives  the  maximal  value 


H{iu . . . ,  4)  —  max  . . . ,  4). 

i=i  ,-,J 

An  important  question  is  whether  the  rollout  algorithm  performs  at  least  as  well  as  its 
base  heuristic  when  started  from  the  initial  partial  schedule.  This  can  be  guaranteed  if  the  base 
heuristic  is  sequentially  consistent.  By  this  we  mean  that  the  heuristic  has  the  following  property: 

Suppose  that  starting  from  a  partial  schedule 

P  =  (4,  •  •  • ,  4— 1)  ? 

the  heuristic  produces  the  complementary  schedule 

P  =  (4-,  •  ■  ■  Am)- 

Then  starting  from  the  partial  schedule 

P~*~  ~  (fi>  •  •  *  j 4— 1 3 4)> 
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the  heuristic  produces  the  complementary  schedule 

P+  —  (ifc+i,  •  •  •,**/)• 

As  an  example,  it  can  be  seen  that  the  index  and  the  greedy  heuristics,  discussed  earlier, 
are  sequentially  consistent.  This  is  a  manifestation  of  a  more  general  property:  many  common 
base  heuristics  of  the  greedy  type  are  by  nature  sequentially  consistent.  It  may  be  verified,  based 
on  Eq.  (2.4),  that  a  sequentially  consistent  rollout  algorithm  keeps  generating  the  same  schedule 
PuP,  up  to  the  point  where  by  examining  the  alternatives  in  Eq.  (2.4)  and  by  calculating  their 
heuristic  rewards,  it  discovers  a  better  schedule.  As  a  result,  sequential  consistency  guarantees 
that  the  reward  of  the  schedules  P  U  P  produced  by  the  rollout  algorithm  is  monotonically 
nonincreasing;  that  is,  we  have 

H(P+)  <  H(P) 

at  every  stage.  For  further  elaboration  of  the  sequential  consistency  property,  we  refer  to  the 
paper  by  Bertsekas,  Tsitsiklis,  and  Wu  [1997],  which  also  discusses  some  underlying  connections 
with  the  policy  iteration  method  of  dynamic  programming. 

A  condition  that  is  more  general  than  sequential  consistency  is  that  the  algorithm  be  se¬ 
quentially  improving,  in  the  sense  that  at  each  stage  there  holds 

H(P+)  <  H(P). 

This  property  also  guarantees  that  the  rewards  of  the  schedules  produced  by  the  rollout  algorithm 
are  monotonically  nonincreasing.  The  paper  by  Bertsekas,  Tsitsiklis,  and  Wu  [1997]  discusses 
situations  where  this  property  holds,  and  shows  that  with  fairly  simple  modification,  a  rollout 
algorithm  can  be  made  sequentially  improving. 

There  are  a  number  of  variations  of  the  basic  rollout  algorithm  described  above.  In  par¬ 
ticular,  we  may  incorporate  multistep  lookahead  or  selective  depth  lookahead  into  the  rollout 
framework.  An  example  of  a  rollout  algorithm  with  m-step  lookahead  operates  as  follows:  at  the 
fcth  stage  we  augment  the  current  partial  schedule  P  =  (ii, . . . ,  ik-i)  with  all  possible  sequences 
of  m  questions  i  ^  i i, . . .  ,ik-i-  We  run  the  base  heuristic  from  each  of  the  corresponding  aug¬ 
mented  partial  schedules,  we  select  the  m-question  sequence  with  maximum  heuristic  reward, 
and  then  augment  the  current  partial  schedule  P  with  the  first  question  in  this  sequence.  An 
example  of  a  rollout  algorithm  with  selective  two-step  lookahead  operates  as  follows:  at  the  kth 
stage  we  start  with  the  current  partial  schedule  P  =  (ii, . . .  ,ik- i),  and  we  run  the  base  heuristic 
starting  from  each  partial  schedule  (i\, . . . ,  ik~i,i)  with  i  ^  i\, . . . ,  ik- 1-  We  then  form  the  subset 
I  consisting  of  the  n  questions  i  ii, . . . ,  ik  that  correspond  to  the  n  best  complete  schedules  thus 
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obtained.  We  run  the  base  heuristic  starting  from  each  of  the  partial  schedules  (ii, . . . 
with  i  €  7  and  j  and  obtain  a  corresponding  complete  schedule.  We  then  select 

as  next  question  ik  of  the  rollout  schedule  the  question  i  €  I  that  corresponds  to  a  maximal 
reward  schedule.  Note  that  by  choosing  the  number  n  to  be  smaller  than  the  maximum  possible, 
N  —  k  + 1 ,  we  can  reduce  substantially  the  computational  requirements  of  the  two-step  lookahead. 


3.  COMPUTATIONAL  EXPERIMENTS  WITH  DETERMINISTIC  QUIZ  PROBLEMS 


In  order  to  explore  the  performance  of  rollout  algorithms  for  deterministic  scheduling,  we  con¬ 
ducted  a  series  of  computational  experiments  involving  the  following  seven  algorithms: 

(1)  The  optimal  stochastic  dynamic  programming  algorithm. 

(2)  The  greedy  heuristic,  where  questions  are  ranked  in  order  of  decreasing  piVi,  and,  for  each 
stage  k,  the  feasible  unanswered  question  with  the  highest  ranking  is  selected. 

(3)  The  index  heuristic,  where  questions  are  ranked  in  order  of  decreasing  pm/(l  —  p%Vi),  and 
for  each  stage  k,  the  feasible  unanswered  question  with  the  highest  ranking  is  selected. 

(4)  The  one-step  rollout  policy  based  on  the  greedy  heuristic,  where,  at  each  stage  k,  for 
every  feasible  unanswered  question  ik  and  prior  sequence  i i, . . . ,  ik- 1,  the  question  is  chosen 
according  to  the  rollout  rule  (2.4),  where  the  function  H  uses  the  greedy  heuristic  as  the 
base  policy. 

(5)  The  one-step  rollout  policy  based  on  the  index  heuristic,  where  the  function  H  in  (2.4)  uses 
the  index  heuristic  as  the  base  policy, 

(6)  The  selective  two-step  lookahead  rollout  policy  based  on  the  greedy  heuristic.  At  the  k- th 
stage,  the  base  heuristic  is  used  in  a  one-step  rollout  to  select  the  best  four  choices  for  the 
current  question  among  the  admissible  choices.  For  each  of  these  choices  at  stage  k,  the 
feasible  continuations  at  stage  k  +  1  are  evaluated  using  the  greedy  heuristic  to  complete 
the  schedule.  The  choice  at  stage  k  is  then  selected  from  the  sequence  with  the  highest 
evaluation. 

(7)  The  selective  two-step  lookahead  rollout  policy  based  on  the  index  heuristic. 

The  problems  selected  for  evaluation  involve  20  possible  questions  and  20  stages,  which  are 
small  enough  so  that  exact  solution  using  dynamic  programming  is  possible.  Associated  with  each 
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question  is  a  sequence  of  times,  determined  randomly  for  each  experiment,  when  that  question 
can  be  attempted.  Floating  point  values  were  assigned  randomly  to  each  question  from  1  to  10  in 
each  problem  instance.  The  probabilities  of  successfully  answering  each  question  were  also  chosen 
randomly,  between  a  specified  lower  bound  and  1.0.  In  order  to  evaluate  the  performance  of  the 
last  six  algorithms,  each  suboptimal  algorithm  was  simulated  10,000  times,  using  independent 
event  sequences  determining  which  questions  were  answered  correctly. 

Our  experiments  focused  on  the  effects  of  two  factors  on  the  relative  performance  of  the 
different  algorithms: 

(a)  The  lower  bound  on  the  probability  of  successfully  answering  a  question,  which  varied  from 
0.2  to  0.8 

(b)  The  average  percent  of  questions  that  are  admissible  (i.e.,  that  can  be  answered)  at  any  one 
stage,  which  ranged  from  10%  to  50%. 

The  first  set  of  experiments  fixed  the  average  percentage  of  questions  which  can  be  answered 
at  a  single  stage  to  10%,  and  varied  the  lower  bound  on  the  probability  of  successfully  answering 
a  question  across  four  conditions:  0.2,  0.4,  0.6  and  0.8.  For  each  experimental  condition,  we  gen¬ 
erated  30  independent  problems  and  solved  them,  and  evaluated  the  corresponding  performance 
using  10,000  Monte  Carlo  runs.  We  computed  the  average  performance  across  the  30  problems, 
and  compared  this  performance  with  the  performance  obtained  using  the  stochastic  dynamic 
programming  algorithm. 

Table  1  shows  the  results  of  our  experiments.  The  average  performance  of  the  greedy  and 
index  heuristics  in  each  condition  are  expressed  in  terms  of  the  percentage  of  the  optimal  perfor¬ 
mance.  For  low  probability  of  success,  both  heuristics  obtain  less  than  half  of  the  performance 
of  the  optimal  algorithm.  The  table  also  illustrates  the  improvement  in  performance  obtained  by 
both  the  one-step  rollout  and  the  selective  two-step  rollout  algorithms,  expressed  in  terms  of  per¬ 
centage  of  the  optimal  performance.  As  an  example,  the  first  column  of  Table  1  gives  the  average 
performance  across  30  problems  with  lower  bound  on  the  probability  of  successfully  answering  a 
question  0.2.  The  performance  achieved  by  the  greedy  heuristic  was  41%  of  optimal,  whereas  the 
average  performance  of  the  one-step  rollout  with  the  greedy  heuristic  as  a  base  policy  achieved 
on  average  75%  of  the  optimal  performance,  which  was  a  34%  improvement.  Furthermore,  the 
two-step  selective  rollout  achieved  on  average  81%  of  the  optimal  performance. 

The  results  in  Table  1  show  that  one-step  rollouts  significantly  improve  the  performance  of 
both  the  greedy  and  the  index  heuristics  in  these  difficult  stochastic  combinatorial  problems.  In 
particular,  the  rollout  algorithms  recovered  in  all  cases  at  least  50%  of  the  loss  of  value  due  to 
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Minimum 
Probability  of 
Success 

0.2 

0.4 

0.6 

0.8 

Greedy  Heuristic 

41% 

50% 

61% 

76% 

Improvement  by 
One-step  Rollout 

34% 

32% 

27% 

14% 

Improvement  by 
Two-step  Rollout 

40% 

34% 

27% 

14% 

Index  Heuristic 

43% 

53% 

66% 

80% 

Improvement  by 
One-step  Rollout 

34% 

30% 

23% 

10% 

Improvement  by 
Two-step  Rollout 

38% 

33% 

24% 

11% 

Table  1:  Performance  of  the  different  algorithms  as  the  minimum  probability 
of  success  of  answering  a  question  varies.  The  average  percentage  of  questions 
which  can  be  answered  at  a  single  stage  is  fixed  at  10%.  The  numbers  reported 
are  percentage  of  the  performance  of  the  optimal  dynamic  programming  solution 
achieved,  averaged  across  30  independent  problems.  As  an  example,  the  first 
column  gives  the  average  performance  across  30  problems  with  lower  bound  on 
the  probability  of  successfully  answering  a  question  0.2.  The  performance  achieved 
by  the  greedy  heuristic  was  41%  of  optimal,  whereas  the  average  performance  of 
the  one-step  rollout  with  the  greedy  heuristic  as  a  base  policy  achieved  on  average 
75%  of  the  optimal  performance,  which  was  a  34%  improvement. 


the  use  of  the  heuristic.  Loss  recovery  of  this  order  or  better  was  typical  in  all  of  the  experiments 
with  rollout  algorithms  in  this  paper.  The  results  also  illustrate  that  the  performance  of  the 
simple  heuristics  improves  as  the  average  probability  of  success  increases,  thereby  reducing  the 
potential  advantage  of  rollout  strategies.  Even  in  these  unfavorable  cases,  the  rollout  strategies 
improved  performance  levels  by  at  least  10%  of  the  optimal  policy,  and  recovered  a  substantial 
portion  of  the  loss  due  to  the  suboptimality  of  the  heuristic. 

For  the  size  of  problems  tested  in  these  experiments,  the  advantages  of  using  a  two-step 
selective  lookahead  rollout  were  small.  In  many  cases,  the  performances  of  the  one-step  rollout 
and  the  two-step  selective  lookahead  rollout  were  identical.  Nevertheless,  for  selected  difficult 
individual  problems,  the  two-step  selective  lookahead  rollout  improved  performance  by  as  much 
as  40%  of  the  optimal  strategy  over  the  level  achieved  by  the  one-step  rollout  with  the  same  base 
heuristic. 

The  second  set  of  experiments  fixed  the  lower  bound  on  the  probability  of  successfully 
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answering  a  question  to  0.2,  and  varied  the  average  percent  of  admissible  questions  at  any  one 
stage  across  3  levels:  10%,  30%  and  50%.  As  before,  we  generated  30  independent  problems 
and  evaluated  the  performance  of  each  algorithm  on  each  problem  instance.  The  results  of  these 
experiments  are  summarized  in  Table  2.  As  before,  the  performance  of  the  greedy  and  index 
heuristics  improves  as  the  experimental  condition  approaches  the  standard  conditions  of  the  quiz 
problem,  where  100%  of  the  questions  can  be  answered  at  any  time.  The  results  confirm  the  trend 
seen  in  Table  1:  even  in  cases  where  the  heuristics  achieve  good  performance,  rollout  strategies 
offer  significant  performance  gains. 


Problem  Density 

0.1 

0.3 

0.5 

Greedy  Heuristic 

41% 

58% 

76% 

Improvement  by 
One-step  Rollout 

34% 

28% 

15% 

Improvement  by 
Two-step  Rollout 

40%  " 

32% 

16% 

Index  Heuristic 

43% 

68% 

85% 

Improvement  by 
One-step  Rollout 

34% 

22% 

8% 

Improvement  by 
Two-step  Rollout 

38% 

24% 

9% 

Table  2:  Performance  of  the  different  algorithms  as  the  average  number  of 
questions  per  period  increases.  The  lower  bound  on  the  probability  of  successfully 
answering  a  question  is  fixed  at  0.2.  The  numbers  reported  are  percentage  of  the 
performance  of  the  optimal  dynamic  programming  solution  achieved,  averaged 
across  30  independent  problems. 


The  results  in  Tables  1  and  2  suggest  that  the  advantage  of  rollout  strategies  over  the 
greedy  and  index  heuristics  increases  with  the  risk  involved  in  the  problem.  This  advantage 
stems  from  the  forward-looking  character  of  rollout  strategies.  In  particular,  by  constructing 
a  feasible  strategy  for  the  entire  horizon  for  evaluating  the  current  decision,  rollout  strategies 
account  for  the  limited  future  accessibility  of  questions,  and  compute  tradeoffs  between  future 
accessibility  and  the  risk  of  the  current  choice.  In  contrast,  myopic  strategies  such  as  the  greedy 
and  index  heuristics  do  not  account  for  future  access  to  questions,  and  thus  are  forced  to  make 
risky  choices  when  no  other  alternatives  are  present.  Thus,  as  the  risk  of  missing  a  question 
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increases  and  the  average  accessibility  of  questions  decreases,  rollout  strategies  achieve  nearly 
double  the  performance  of  the  corresponding  myopic  heuristics. 


4.  ROLLOUT  ALGORITHMS  FOR  STOCHASTIC  QUIZ  PROBLEMS 


We  now  consider  variants  of  the  quiz  problem  where  there  is  no  optimal  policy  that  is  open-loop. 
The  situations  (e)-(i)  given  in  Section  1  provide  examples  of  quiz  problems  of  this  type.  We  can 
view  such  problems  as  stochastic  DP  problems.  Their  exact  solution,  however,  is  prohibitively 
expensive. 

Let  us  state  a  quiz  problem  in  the  basic  form  of  a  dynamic  programming  problem  (see  e.g., 
Bertsekas  [1995]),  where  we  have  the  stationary  discrete-time  dynamic  system 

xk+i  ~  fk(xk,uk,wk),  k  =  0, 1, . . .  ,T,  (4.1) 


that  evolves  over  T  time  periods.  Here  X&  is  the  state  taking  values  in  some  set,  Uk  is  the 
control  to  be  selected  from  a  finite  set  Uk{xk),  Wk  is  a  random  disturbance,  and  fk  is  a  given 
function.  We  assume  that  the  disturbance  Wk,  k  =  0, 1, . . .,  has  a  given  probability  distribution 
that  depends  explicitly  only  on  the  current  state  and  control.  The  one-stage  cost  function  is 
denoted  by  gk{x,u,w).  In  this  general  framework,  we  assume  that  costs  are  minimized,  but  the 
following  discussion  can  be  easily  adapted  to  the  case  where  rewards  are  maximized. 


To  apply  the  rollout  framework,  we  need  to  have  a  base  policy  for  making  a  decision 
at  each  state- time  pair  ( Xk ,  k) .  We  view  this  policy  as  a  sequence  of  feedback  functions  rr  = 
■  ■  .  ,/j-t},  which  at  time  k  maps  a  state  x*  to  a  control  gk{xk)  S  Uk(xk)-  The  cost-to-go 
of  7T  starting  from  a  state-time  pair  (x*, ,  k)  will  be  denoted  by 


J k  (Xk )  —  E 


E 


g%  {xi,  m(xi),Wi)  | . 


(4.2) 


l  i~k  ) 

The  cost-to-go  functions  Jk  satisfy  the  following  recursion  of  dynamic  programming  (DP  for 
short) 


4(x)  =  E{g(x,iJ,k(x),w)  +  Jk+1(f(x,Hk(x),w))},  k  =  0,1,...  (4.3) 


with  the  initial  condition 

Jt(x)  =  0. 


The  rollout  policy  based  on  tt  is  denoted  by  ir  =  and  is  defined  through  the 

operation 

JIfc(x)  =  arg  min  E{g(x,  u,  w)  +  Jk+i  (f(x,u,  w)) },  V  x,  k  =  0, 1, -  (4.4) 

u^U(x) 
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Thus  the  rollout  policy  is  a  one  step-lookahead  policy,  with  the  optimal  cost-to-go  approximated 
by  the  cost-to-go  of  the  base  policy.  This  amounts  essentially  to  a  single  step  of  the  method  of 
policy  iteration.  Indeed  using  standard  policy  iteration  arguments,  one  can  show  that  the  rollout 
policy  W  is  an  improved  policy  over  the  base  policy  -k. 

In  practice,  one  typically  has  a  method  or  algorithm  to  compute  the  control  Hk(%)  of  the 
base  policy,  given  the  state  x,  but  the  corresponding  cost-to-go  functions  Jk  may  not  be  known 
in  closed  form.  Then  the  exact  or  approximate  computation  of  the  rollout  control  JIk(: r)  using 
Eq.  (4.4)  becomes  an  important  and  nontrivial  issue,  since  we  need  for  all  u  £  U(x)  the  value  of 

Qk(x,u)  =  E{g(x,u,w)  +  Jk+i(f(x,u,w))},  (4.5) 

known  as  the  Q -factor  at  time  k.  Alternatively,  for  the  computation  of  ~pk(x)  we  need  the  value 
of  the  cost-to-go 

Jk+i(f(x,u,w)) 

at  all  possible  next  states  f(x,u,  w). 

In  favorable  cases,  it  is  possible  to  compute  the  cost-to-go  Jk{ x)  of  the  base  policy  7 r  for 
any  time  k  and  state  x.  An  example  is  the  variant  of  the  quiz  problem  discussed  in  Sections  2 
and  3,  where  the  base  policy  is  an  open-loop  policy  that  consists  of  the  schedule  generated  by 
the  index  policy  or  the  greedy  policy.  The  corresponding  cost-to-go  can  then  be  computed  using 
Eq.  (2.1).  In  general,  however,  the  computation  of  the  cost-to-go  of  the  base  policy  may  be  much 
more  difficult.  In  particular,  when  the  number  of  states  is  very  large,  the  DP  recursion  (4.3)  may 
be  infeasible. 

A  conceptually  straightforward  approach  for  computing  the  rollout  control  at  a  given  state 
x  and  time  k  is  to  use  Monte  Carlo  simulation.  This  was  proposed  by  Tesauro  [TeG96j  in  the 
context  of  backgammon.  In  particular,  for  a  given  backgammon  position  and  a  given  roll  of 
the  dice,  Tesauro  suggested  looking  at  all  possible  ways  to  play  the  given  roll,  and  do  a  Monte- 
Carlo  evaluation  of  the  expected  score  starting  from  the  resulting  position  and  using  some  base 
computer  program  to  play  out  the  game  (for  both  sides).  To  implement  this  approach  in  the 
context  of  a  general  DP  problem,  we  consider  all  possible  controls  u  £  U(x)  and  we  generate 
a  “large”  number  of  simulated  trajectories  of  the  system  starting  from  x,  using  u  as  the  first 
control,  and  using  the  policy  7r  thereafter.  Thus  a  simulated  trajectory  has  the  form 

Xi+i  -  f(xi,m(xi),  Wi),  i  =  k  +  1, . . .  ,T  -  1, 


where  the  first  generated  state  is 


Xk+ 1  =  f(x,U,Wk), 


14 


4.  Rollout  Algorithms  for  Stochastic  Quiz  Problems 

and  each  of  the  disturbances  wk , . .  - ,  wt- i  is  an  independent  random  sample  from  the  given  distri¬ 
bution.  The  costs  corresponding  to  these  trajectories  are  averaged  to  compute  an  approximation 
Qk(x,u)  to  the  Q-factor  Qk(x,u)  of  Eq.  (4.5).  The  approximation  becomes  increasingly  accu¬ 
rate  as  the  number  of  simulated  trajectories  increases.  Once  the  approximate  Q- factor  Qk(x,u) 
corresponding  to  each  control  u  6  U(x)  is  computed,  we  can  obtain  the  (approximate)  rollout 
control  p,k{x)  by  the  minimization 

£*:( x)  =  arg  min  Qk(x,u). 
ueu  (i) 

Unfortunately,  this  method  suffers  from  the  excessive  computational  overhead  of  the  Monte 
Carlo  simulation.  We  are  thus  motivated  to  consider  approximations  that  involve  reduced  over¬ 
head,  and  yet  capture  the  essense  of  the  basic  rollout  idea.  We  describe  next  an  approximation 
approach  of  this  type,  and  in  the  following  section,  we  discuss  its  application  to  stochastic  schedul¬ 
ing  problems. 

Approximation  Using  Scenario 

Let  us  suppose  that  we  approximate  the  cost-to-go  of  the  base  policy  w  using  certainty  equivalence. 
In  particular,  given  a  state  xk  at  time  k,  we  fix  the  remaining  disturbances  at  some  nominal  values 
Wk, u>k+i,  ■  ■  ■  and  we  generate  the  associated  state  and  control  trajectory  of  the  system 

using  the  base  policy  rr  starting  from  xk  and  time  k.  The  corresponding  cost  is  denoted  by  Jk{xk), 
and  is  used  as  an  approximation  to  the  true  cost  Jk(xk).  The  approximate  rollout  control  based 
on  7r  is  given  by 

Afc(ar)  =  arg  ram  E{g{xk,u,w)  +  Jk+i(f(xk,u,w))}. 

u£U  (x) 

We  thus  need  to  run  i r  from  all  possible  next  states  f(xk,  u,  w)  and  evaluate  the  corresponding  ap¬ 
proximate  cost-to-go  Jfc+i  (f(xk,  u,  w))  using  a  single  state-control  trajectory  calculation  based  on 
the  nominal  values  of  the  uncertainty.  The  nominal  disturbance  sequence  {wk,wk+i, . . . 
may  be  state-dependent,  and  in  a  practical  setting,  its  choice  is  intended  to  capture  “interesting 
and  representative”  aspects  of  the  problem’s  uncertainty.  This  is  hard  to  characterize  precisely 
in  general,  but  it  may  be  meaningful  in  specific  contexts. 

The  certainty  equivalent  approximation  involves  a  single  nominal  trajectory  of  the  remaining 
uncertainty.  To  strengthen  this  approach,  it  is  natural  to  consider  multiple  trajectories  of  the 
uncertainty,  called  scenario,  and  to  construct  an  approximation  to  the  relevant  Q-factors  that 
involves,  for  every  one  of  the  scenaria,  the  cost  of  the  base  policy  tt.  Mathematically,  we  assume 
that  we  have  a  method,  which  at  each  state  xk,  generates  M  uncertainty  sequences 

wm(xk)  =  m  =  l,...,M. 
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The  sequences  wm(xk)  are  the  scenaria  at  state  xk-  The  cost  Jk{xk)  of  the  base  policy  is 
approximated  by 

M 

Jk{xk,r)  -tq  +  rmCm(xk),  (4.6) 

m=l 

where  r  =  (ro,ri, . . .  ,r«)  is  a  vector  of  parameters  to  be  determined,  and  Cm(xk)  is  the  cost 
corresponding  to  an  occurence  of  the  scenario  wm(xk),  when  starting  at  state  Xk  and  using 
the  base  policy.  We  may  interpret  the  parameter  rm  as  an  “aggregate  weight”  that  encodes 
the  aggregate  effect  on  the  cost-to-go  function  of  uncertainty  sequences  that  are  similar  to  the 
scenario  wm{xk )■  We  will  assume  for  simplicity  that  r  does  not  depend  on  the  time  index  k  or 
the  state  xk ■  However,  there  are  interesting  possibilities  for  allowing  a  dependence  of  r  on  k 
and/or  xk,  with  straightforward  changes  in  the  following  methodology.  Note  that,  if  ro  =  0,  the 
approximation  (4.6)  may  be  also  be  viewed  as  limited  simulation  approach,  based  on  just  the  M 
scenaria  wm(xk),  and  using  the  weights  rm  as  “aggregate  probabilities.” 

Given  the  parameter  vector  r,  and  the  corresponding  approximation  Jk(xk,r)  to  the  cost 
of  the  base  policy,  as  defined  above,  a  corresponding  approximate  rollout  policy  is  determined  by 

pk(x)  -  arg  min  Qk{x,u,r),  (4.7) 

ueu{x) 

where 

Qk(x,u.r)  =  E{g(x,  u,  w)  +  Jk+i(f(x,u,w),r)}  (4.8) 

is  the  approximate  Q-factor.  We  envision  here  that  the  parameter  r  will  be  determined  by  an 
off-line  “training  ’’process  and  it  will  then  be  used  for  calculating  on-line  the  approximate  rollout 
policy  as  above. 

One  may  use  standard  methods  of  NDP  to  train  the  parameter  vector  r.  In  particular,  we 
may  view  the  approximating  function  Jk{xk,r )  of  Eq.  (4.6)  as  a  linear  feature-based  architecture 
where  the  scenaria  costs  Cm(xk)  are  the  features  at  state  xk-  One  possibility  is  to  use  a  straight¬ 
forward  least  squares  fit  of  Jk{xk,r )  to  random  sample  values  of  the  cost-to-go  Jk(xk )■  These 
sample  values  may  be  obtained  by  Monte-Carlo  simulation,  starting  from  a  representative  subset 
of  states.  Another  possibility  is  to  use  Sutton’s  TD(A).  We  refer  to  the  books  by  Bertsekas  and 
Tsitsiklis  [BeT96]  and  Barto  and  Sutton  [BaS98],  and  the  survey  by  Barto  et.  al.  [BBS95]  for 
extensive  accounts  of  training  methods  and  relating  techniques. 

We  finally  mention  a  variation  of  the  scenario-based  approximation  method,  whereby  partial 
scenaria  are  used.  In  particular,  only  a  portion  of  the  future  uncertain  quantities  are  fixed  at 
nominal  scenario  values,  while  the  remaining  uncertain  quantities  are  explicitly  viewed  as  random. 
The  cost  of  scenario  m  at  state  xk  is  now  a  random  variable,  and  the  quantity  Cm(xk )  used  in 
Eq.  (4.6)  should  be  the  expected  cost  of  this  random  variable.  This  variation  is  appropriate  and 
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makes  practical  sense  as  long  as  the  computation  of  the  corresponding  expected  scenaria  costs 
Cm{xk)  is  convenient. 


5.  ROLLOUT  ALGORITHMS  FOR  STOCHASTIC  QUIZ  PROBLEMS 


We  now  apply  the  rollout  approach  based  on  certainty  equivalence  and  scenaria  to  variants  of 
the  quiz  problem  where  there  is  no  optimal  policy  that  is  open-loop,  such  as  the  situations  (e)-(i) 
given  in  Section  1.  The  state  after  questions  i\, . . . ,  ik  have  been  successfully  answered,  is  the 
current  partial  schedule  (ii,...,4),  and  possibly  the  list  of  surviving  quiz  takers  [in  the  case 
where  there  are  multiple  quiz  takers,  as  in  variant  (g)  of  Section  1].  A  (partial)  scenario  at 
this  state  corresponds  to  a  (deterministic)  sequence  of  realizations  of  some  of  the  future  random 
quantities,  such  as: 

(1)  The  list  of  turns  that  will  be  missed  in  answer  attempts  from  time  k  onward;  this  is  for  the 
case  of  variant  (e)  in  Section  1. 

(2)  The  list  of  new  questions  that  will  appear  and  old  questions  that  will  disappear  from  time 
k  onward;  this  is  for  the  case  of  variant  (f)  in  Section  1. 

(3)  The  specific  future  times  at  which  the  surviving  quiz  takers  will  drop  out  of  the  quiz;  this 
is  for  the  case  of  variant  (g)  in  Section  1 . 

Given  any  scenario  of  this  type  at  a  given  state,  and  a  base  heuristic  such  as  an  index  or 
a  greedy  policy,  the  corresponding  value  of  the  heuristic  [cf.  the  cost  Cm(xk)  in  Eq.  (4.6)]  can 
be  easily  calculated.  The  approximate  value  of  the  heuristic  at  the  given  state  can  be  computed 
by  weighing  the  values  of  all  the  scenaria  using  a  weight  vector  r,  as  in  Eq.  (4.6).  In  the  case  of 
a  single  scenario,  a  form  of  certainty  equivalence  is  used,  whereby  the  value  of  the  scenario  at  a 
given  state  is  used  as  the  (approximate)  value  of  the  heuristic  starting  from  that  state.  In  the 
next  section  we  present  computational  results  for  the  case  of  a  problem,  which  is  identical  to  the 
one  tested  in  Section  3,  but  a  turn  may  be  missed  with  a  certain  probability. 


6.  COMPUTATIONAL  EXPERIMENTS  WITH  STOCHASTIC  QUIZ  PROBLEMS 
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The  class  of  quiz  problems  which  we  used  in  our  computational  experiments  are  similar  to  the 
problems  used  in  Section  3,  with  the  additional  feature  that  an  attempt  to  answer  a  question 
can  be  blocked  with  a  prespecified  probability,  corresponding  to  the  case  of  variant  (e)  in  Section 
1.  The  problems  involve  20  questions  and  20  time  periods,  where  each  question  has  a  prescribed 
set  of  times  where  it  can  be  attempted.  The  result  of  a  blocking  event  is  a  loss  of  opportunity 
to  answer  any  question  at  that  stage.  Unanswered  questions  can  be  attempted  in  future  stages, 
until  a  wrong  answer  is  obtained. 

In  order  to  evaluate  the  performance  of  the  base  policy  for  rollout  algorithms,  we  use  a 
single  partial  scenario  version  of  the  approach  described  in  the  preceding  section.  Assume  that 
the  blocking  probability  is  denoted  by  P&.  For  an  AA-stage  problem,  at  any  stage  k,  we  compute 
an  “equivalent”  scenario  duration  Te  as  the  smallest  integer  greater  than  or  equal  to  the  expected 
number  of  remaining  stages  where  there  will  be  no  blocking.  The  number  of  remaining  stages  is 
M  —  k,  and  the  probability  of  no  blocking  in  each  one  of  them  is  1  -  Pf,,  so  we  have 

Te  =  ^1  -  Pb)  *  (M  -  A)1 

At  a  given  state  and  stage  k,  the  expected  reward  of  a  base  heuristic  for  the  stochastic  quiz 
problem  is  approximated,  using  Eq.  (2.1),  as  the  expected  reward  obtained  using  the  heuristic  in 
a  deterministic  quiz  problem  starting  with  the  given  state,  with  remaining  duration  Te  (rather 
than  M  -  k). 

As  in  Section  4,  we  used  seven  algorithms  in  our  experiments: 

(1)  The  optimal  stochastic  dynamic  programming  algorithm. 

(2)  The  greedy  heuristic,  where  questions  are  ranked  in  decreasing  piVi,  and,  for  each  stage  k, 
the  feasible  unanswered  question  with  the  highest  ranking  is  selected. 

(3)  The  index  heuristic,  where  questions  are  ranked  by  decreasing  pm/(  1  -  pm),  and  for  each 
stage  k,  the  feasible  unanswered  question  with  the  highest  ranking  is  selected. 

(4)  The  one-step  rollout  policy  based  on  the  greedy  heuristic  and  certainty  equivalence  policy 
evaluation,  where,  at  each  stage  k,  for  every  feasible  unanswered  question  ik  and  prior 
sequence  ii, . . . ,  ik- 1,  the  question  is  chosen  according  to  the  rollout  rule  (2.4).  The  function 
H  uses  the  greedy  heuristic  as  the  base  policy,  and  its  performance  is  approximated  by  the 
performance  of  an  equivalent  non-blocking  quiz  problem  as  described  above. 

(5)  The  one-step  rollout  policy  based  on  the  index  heuristic  and  certainty  equivalence  policy 
evaluation,  where  the  function  H  in  (2.4)  uses  the  index  heuristic  as  the  base  policy,  and  is 
approximated  using  the  certainty  equivalence  approach  described  previously. 
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(6)  The  selective  two-step  lookahead  rollout  policy  based  on  the  greedy  heuristic,  with  certainty 
equivalence  policy  evaluation  corresponding  to  an  equivalent  non-blocking  quiz  problem  with 
horizon  described  as  above. 

(7)  The  selective  two-step  lookahead  rollout  policy  based  on  the  index  heuristic,  with  certainty 
equivalence  policy  evaluation  corresponding  to  an  equivalent  non-blocking  quiz  problem 
with  horizon  described  as  above. 

The  problems  selected  for  evaluation  involve  20  possible  questions  and  20  stages,  which  are 
small  enough  so  that  exact  solution  using  dynamic  programming  is  possible.  Associated  with  each 
question  is  a  sequence  of  times,  determined  randomly  for  each  experiment,  when  that  question 
can  be  attempted.  Floating  point  values  were  assigned  randomly  to  each  question  from  1  to  10  in 
each  problem  instance.  The  probabilities  of  successfully  answering  each  question  were  also  chosen 
randomly,  between  a  specified  lower  bound  and  1.0.  In  order  to  evaluate  the  performance  of  the 
last  six  algorithms,  each  suboptimal  algorithm  was  simulated  10,000  times,  using  independent 
event .  sequences  determining  which  question  attempts  were  blocked  and  which  questions  were 
answered  correctly. 

Our  experiments  focused  on  the  effects  of  three  factors  on  the  relative  performance  of  the 
different  algorithms: 

(a)  The  lower  bound  on  the  probability  of  successfully  answering  a  question,  which  varied  from 
0.2  to  0.8 

(b)  The  average  percent  of  admissible  questions  at  any  one  stage,  which  ranged  from  10%  to 
50%. 

(c)  The  probability  1  —  Pb  that  individual  question  attempts  will  not  be  blocked,  ranging  from 
0.3  to  1.0. 

As  in  Section  4,  for  each  experimental  condition,  we  generated  30  independent  problems 
and  solved  them  with  each  of  the  7  algorithms,  and  evaluate  the  corresponding  performance  using 
10,000  Monte  Carlo  runs.  The  average  performance  is  reported  for  each  condition. 

The  first  set  of  experiments  fixed  the  average  percentage  of  admissible  questions  at  a  single 
stage  to  10%,  the  probability  that  question  attempts  will  not  be  blocked  to  0.6,  and  varied  the 
lower  bound  on  the  probability  of  successfully  answering  a  question  across  four  conditions:  0.2, 
0.4,  0.6  and  0.8.  Table  3  shows  the  results  of  our  experiments.  The  average  performance  of  the 
greedy  and  index  heuristics  in  each  condition  are  expressed  in  terms  of  the  percentage  of  the 
optimal  performance.  The  results  for  this  experiment  are  very  similar  to  the  results  we  obtained 
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Minimum 
Probability  of 
Success 

0.2 

0.4 

0.6 

0.8 

Greedy  Heuristic 

54% 

63% 

73% 

82% 

Improvement  by 
One-step  Rollout 

31% 

26% 

17% 

6% 

Improvement  by 
Two-step  Rollout 

33% 

26% 

17% 

6% 

Index  Heuristic 

56% 

67% 

78% 

84% 

Improvement  by 
One-step  Rollout 

30% 

22% 

12% 

4% 

Improvement  by 
Two-step  Rollout 

31% 

23% 

12% 

4% 

Table  3:  Performance  of  the  different  algorithms  for  stochastic  quiz  problems 
as  the  minimum  probability  of  success  of  answering  a  question  varies.  The  av¬ 
erage  percentage  of  admissible  questions  at  a  single  stage  and  the  probability 
that  question  attempts  will  not  be  blocked  are  fixed  at  10%  and  0.6,  respectively. 
The  numbers  reported  are  percentage  of  the  performance  of  the  optimal  dynamic 
programming  solution  achieved,  averaged  across  30  independent  problems. 


earlier  for  deterministic  quiz  problems.  Without  rollouts,  the  performance  of  either  heuristic  is 
poor,  whereas  the  use  of  one-step  rollouts  can  recover  a  significant  percentage  of  the  optimal 
performance.  As  the  risk  associated  with  answering  questions  decreases,  the  performance  of  the 
heuristics  improves,  and  the  resulting  improvement  offered  by  the  use  of  rollouts  decreases.  On 
average,  the  advantage  of  using  selective  two-step  rollouts  is  small,  but  this  advantage  can  be 
large  for  selected  difficult  problems. 

The  second  set  of  experiments  fixed  the  lower  bound  on  the  probability  of  successfully 
answering  a  question  to  0.2,  and  varied  the  average  percent  of  admissible  questions  at  any  one 
stage  across  3  levels:  10%,  30%  and  50%.  The  results  of  these  experiments  are  summarized 
in  Table  4.  As  in  the  deterministic  quiz  problems,  the  performance  of  the  greedy  and  index 
heuristics  improves  as  the  number  of  admissible  questions  at  any  one  stage  approaches  100%. 
The  results  also  show  that,  even  in  cases  where  the  heuristics  achieve  good  performance,  rollout 
strategies  offer  significant  performance  gains. 

The  last  set  of  experiments  fixed  the  lower  bound  on  the  probability  of  successfully  answering 
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Problem  Density 

0.1 

0.3 

0.5 

Greedy  Heuristic 

54% 

65% 

78% 

Improvement  by 
One-step  Rollout 

31% 

23% 

13% 

Improvement  by 
Two-step  Rollout 

33% 

24% 

13% 

Index  Heuristic 

56% 

74% 

87% 

Improvement  by 
One-step  Rollout 

30% 

15% 

5% 

Improvement  by 
Two-step  Rollout 

31% 

16% 

5% 

Table  4:  Performance  of  the  different  algorithms  on  stochastic  quiz  problems 
as  the  average  number  of  questions  per  period  increases.  The  lower  bound  on  the 
probability  of  successfully  answering  a  question  and  the  probability  that  question 
attempts  will  not  be  blocked  are  fixed  at  0.2  and  0.6,  respectively.  The  numbers 
reported  are  percentage  of  the  performance  of  the  optimal  dynamic  programming 
solution  achieved,  averaged  across  30  independent  problems. 

a  question  to  0.2,  focused  on  varying  the  probability  l—Pb  that  an  attempt  to  answer  a  question 
at  any  one  time  is  not  blocked  over  3  conditions:  0.3.  0.6  and  1.0.  The  last  condition  corresponds 
to  the  deterministic  quiz  problems  of  Section  3.  Table  5  contains  the  results  of  these  experiments. 
As  the  blocking  probability  increases,  there  is  increased  randomness  as  to  whether  questions  may 
be  available  in  the  future.  This  increased  randomness  leads  to  improved  performance  of  myopic 
strategies,  as  shown  in  Table  5.  Again,  the  advantages  of  the  rollout  strategies  are  evident  even 
in  this  favorable  case. 

The  results  in  Tables  3,  4  and  5  provide  ample  evidence  that  rollout  strategies  enhance 
substantially  the  performance  of  heuristics  for  stochastic  quiz  problems,  while  maintaining  poly¬ 
nomial  solution  complexity. 


7.  QUIZ  PROBLEMS  WITH  GRAPH  PRECEDENCE  CONSTRAINTS 


The  previous  set  of  experiments  focused  on  quiz  problems  where  questions  could  be  attempted 
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Probability  of 
Non-Blocking 

0.3 

0.6 

1 

Greedy  Heuristic 

73% 

54% 

41% 

Improvement  by 
One-step  Rollout 

17% 

31% 

34% 

Improvement  by 
Two-step  Rollout 

18% 

33% 

40% 

Index  Heuristic 

75% 

56% 

43% 

Improvement  by 
One-step  Rollout 

16% 

30% 

34% 

Improvement  by 
Two-step  Rollout 

16% 

31% 

38% 

Table  5:  Performance  of  the  different  algorithms  on  stochastic  quiz  problems 
as  the  probability  of  non-blocking  increases.  The  average  percentage  of  admissible 
questions  at  a  single  stage  and  the  lower  bound  on  the  probability  of  successfully 
answering  a  question  are  fixed  10%  and  0.2,  respectively.  The  numbers  reported 
are  percentage  of  the  performance  of  the  optimal  dynamic  programming  solution 
achieved,  averaged  across  30  independent  problems. 

during  specific  time  periods,  with  no  constraints  imposed  on  the  questions  which  had  been  at¬ 
tempted  previously.  In  order  to  study  the  effectiveness  of  rollout  strategies  for  stochastic  schedul¬ 
ing  problems  with  precedence  constraints,  we  defined  a  class  of  quiz  problems  where  the  sequence 
of  questions  to  be  attempted  must  form  a  connected  path  in  a  graph.  In  these  problems,  a 
question  cannot  be  blocked  as  in  the  problems  of  Section  6,  so  there  exists  an  optimal  open-loop 
policy. 

Let  Q  =  (Af,A)  be  a  directed  graph  where  the  nodes  N  represent  questions  in  a  quiz 
problem.  Associated  with  each  node  n  is  a  value  for  answering  the  question  correctly,  vn ,  and  a 
probability  of  correctly  answering  the  question,  pn.  Once  a  question  has  been  answered  correctly 
at  node  n,  the  value  of  subsequent  visits  to  node  n  is  reduced  to  zero,  and  there  is  no  risk  of 
failure  on  subsequent  visits  to  node  n. 

The  graph  constrains  the  quiz  problem  as  follows:  a  question  m  may  be  attempted  at  stage 
k  only  if  there  is  an  arc  (n,  m)  €  A,  where  n  is  the  question  attempted  at  stage  k  —  1.  The 
graph-constrained  quiz  problem  of  duration  N  consists  of  finding  a  path  no,ni, . . .  ,un  in  the 
graph  Q  such  that  no  is  the  fixed  starting  node,  (nk,  n*+i)  €  A  for  all  k  —  0, . . . ,  N  —  1,  and  the 
path  maximizes  the  expected  value  of  the  questions  answered  correctly  before  the  first  erroneous 
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answer. 

The  previous  heuristic  algorithms  can  be  extended  to  the  graph-constrained  case.  The 
greedy  heuristic  can  be  described  as  follows:  Given  that  the  current  attempted  question  was  n, 
determine  the  feasible  questions  i  such  that  (n,  i)  e  A.  Select  the  feasible  question  which  has  the 
highest  expected  value  for  the  next  attempt  pzvl.  In  the  graph-constrained  problem,  it  is  possible 
that  there  are  no  feasible  questions  with  positive  value,  and  the  path  is  forced  to  revisit  a  question 
already  answered.  If  no  feasible  question  has  positive  value,  the  greedy  heuristic  is  modified  to 
select  a  feasible  node  which  has  been  visited  the  least  number  of  times  among  the  feasible  nodes 
from  node  n.  The  index  heuristic  is  defined  similarly,  except  that  the  index  pzVi/(l  -  pzvz)  is 
used  to  rank  the  feasible  questions. 

One-step  rollout  policies  can  be  based  on  the  greedy  or  index  heuristics,  as  before.  Since 
the  class  of  problems  is  similar  to  the  deterministic  quiz  problems  discussed  earlier,  it  is  straight¬ 
forward  to  determine  the  expected  value  associated  with  a  given  policy.  The  rollout  policies  are 
based  on  exact  evaluation  of  these  expected  values. 

In  the  experiments  below,  we  compare  the  following  five  algorithms: 

(1)  The  optimal  dynamic  programming  algorithm. 

(2)  The  greedy  policy. 

(3)  The  index  policy. 

(4)  The  one-step  rollout  policy  based  on  the  greedy  heuristic. 

(5)  The  one-step  rollout  policy  based  on  the  index  heuristic. 

The  first  set  of  experiments  involves  problems  with  16  questions  and  16  stages.  This  problem 
size  is  small  enough  to  permit  exact  solution  using  the  dynamic  programming  algorithm.  The 
questions  were  valued  from  1  to  10,  selected  randomly.  On  average,  each  node  was  connected  to  5 
other  nodes,  corresponding  to  30%  density.  In  these  experiments,  the  probability  of  successfully 
answering  a  question  was  randomly  selected  between  a  lower  bound  and  1.0,  and  the  lower  bound 
was  varied  from  0.2  to  0.8,  thereby  varying  the  average  risk  associated  with  a  problem. 

Table  6  summarizes  the  results  of  these  experiments.  The  first  observation  is  that  the 
performance  of  the  heuristics  in  graph-constrained  problems  is  relatively  superior  to  the  perfor¬ 
mance  obtained  in  the  experiments  in  Section  4.  This  is  due  in  part  to  the  lack  of  structure 
concerning  when  questions  could  be  attempted  in  the  problems  tested  in  Section  4.  In  contrast, 
the  graph  structure  in  this  section  provides  a  time- invariant  set  of  constraints,  leading  to  better 
performance.  In  spite  of  this  improved  performance,  the  results  show  that  rollout  algorithms  can 
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improve  the  performance  of  the  heuristics,  to  levels  where  the  achieved  performance  is  roughly 
95%  of  the  performance  of  the  optimal  dynamic  programming  algorithm,  with  a  significant  re¬ 
duction  in  computation  cost  compared  with  the  optimal  algorithm. 


Minimum 
Probability  of 
Success 

0.2 

0.4 

0.6 

0.8 

Greedy  Heuristic 

74% 

77% 

77% 

84% 

Improvement  by 
One-step  Rollout 

20% 

17% 

14% 

10% 

Index  Heuristic 

84% 

87% 

89% 

90% 

Improvement  by 
One-step  Rollout 

11% 

9% 

7% 

5% 

Table  6:  Performance  of  the  different  algorithms  on  graph-constrained  quiz 

problems  as  the  minimum  probability  of  success  of  answering  a  question  increases. 

The  probability  of  successfully  answering  a  question  was  randomly  selected  be¬ 
tween  a  lower  bound  and  1.0,  and  the  lower  bound  was  varied  from  0.2  to  0.8. 

The  numbers  reported  are  percentage  of  the  performance  of  the  optimal  dynamic 
programming  solution  achieved,  averaged  across  30  independent  problems. 

To  illustrate  the  performance  of  rollout  algorithms  on  larger  problems,  we  ran  experiments 
on  graphs  involving  100  questions  and  100  stages.  For  problems  of  this  size,  exact  solution  via 
dynamic  programming  is  computationally  infeasible.  The  problems  involved  graphs  with  10% 
density  and  varying  risks  as  before.  The  results  are  summarized  in  Table  7.  Since  there  is  no 
optimal  solution  for  reference,  the  results  include  the  average  improvement  by  the  rollout  strate¬ 
gies  over  the  corresponding  heuristics,  expressed  as  a  percentage  of  the  performance  achieved  by 
the  rollout  strategies.  The  average  improvement  achieved  by  the  rollout  algorithms,  as  shown  in 
Table  7,  is  consistent  with  the  corresponding  improvement  shown  in  Table  6.  The  results  indicate 
that  rollout  strategies  continue  to  offer  significant  performance  advantages  over  the  corresponding 
heuristics.  In  contrast  with  the  optimal  dynamic  programming  algorithm,  the  average  compu¬ 
tation  time  for  these  problems  when  using  rollout  algorithms  is  a  fraction  of  a  second  on  a  Sun 
HyperSparc  workstation. 
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Minimum 
Probability  of 
Success 

0.2 

0.4 

0.6 

0.8 

Improvement 
over  Greedy  by 
One-step 
Rollout 

28% 

29% 

31% 

24% 

Improvement 
over  Index  by 
One-step 
Rollout 

13% 

12% 

10% 

6% 

Table  7:  Performance  improvement  achieved  by  rollout  algorithms  over  the 

corresponding  heuristics  on  100  question  graph-constrained  quiz  problems  as  the 
minimum  probability  of  success  of  answering  a  question  increases.  The  numbers 
reported  are  percentage  of  the  performance  of  the  rollout  algorithms,  averaged 
across  30  independent  problems. 

8.  CONCLUSION 


In  this  paper,  we  studied  stochastic  scheduling  problems  arising  from  variations  of  a  classical 
search  problem  known  as  a  quiz  problem.  We  grouped  these  variations  into  two  classes:  the 
deterministic  quiz  problems,  for  which  optimal  strategies  can  be  expressed  as  deterministic  se¬ 
quences,  and  the  stochastic  quiz  problems,  for  which  optimal  strategies  are  feedback  functions 
of  the  problem  state.  For  either  of  these  classes,  the  computational  complexity  of  obtaining  ex¬ 
act  optimal  solutions  grows  exponentially  with  the  size  of  the  scheduling  problem,  limiting  the 
applicability  of  exact  techniques  such  as  stochastic  dynamic  programming. 

In  this  paper,  we  develop  near-optimal  solution  approaches  for  deterministic  and  stochastic 
quiz  problems  that  are  computationally  tractable  based  on  the  use  of  rollout  algorithms.  For 
stochastic  quiz  problems,  we  introduced  a  novel  approach  to  policy  evaluation,  based  on  the 
use  of  scenaria,  which  resulted  in  polynomial  complexity  algorithms  for  obtaining  near-optimal 
strategies.  Our  computational  experiments  show  that  these  rollout  algorithms  can  substan¬ 
tially  improve  the  performance  of  index-based  and  greedy  algorithms  for  both  deterministic  and 
stochastic  quiz  problems. 
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